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Abstract 

Clustering is an essential problem in machine learning and data mining. One vital factor that 
impacts clustering performance is how to learn or design the data representation (or features). 
Fortunately, recent advances in deep learning can learn unsupervised features effectively, and 
have yielded state of the art performance in many classification problems, such as character 
recognition, object recognition and document categorization. However, little attention has been 
paid to the potential of deep learning for unsupervised clustering problems. In this paper, we 
propose a deep belief network with nonparametric clustering. As an unsupervised method, our 
model first leverages the advantages of deep learning for feature representation and dimension 
reduction. Then, it performs nonparametric clustering under a maximum margin framework - 
a discriminative clustering model and can be trained online efficiently in the code space. Lastly 
model parameters are refined in the deep belief network. Thus, this model can learn features for 
clustering and infer model complexity in an unified framework. The experimental results show 
the advantage of our approach over competitive baselines. 


1 Introduction 

Clustering methods, such as k-means, Gaussian mixture model (GMM), spectral clustering and 
non-parametrical Bayesian methods, have been widely used in machine learning and data mining. 
Among various clustering methods, nonparametric Bayesian model is one of promising approaches 
for data clustering, because of its ability to infer the model complexity from the data automatically. 
To mine clusters or patterns from data, we can group them based on some notion of similarity. In 
general, calculating the clustering similarity is dependent on the features describing data. Thus, 
feature representation is vital for successful clustering. Just as common for other clustering methods, 
the presence of noisy and irrelevant features can degrade clustering performance, making feature 
representation an important factor in cluster analysis. Moreover, different features may be relevant 
or irrelevant in the high dimensional data, suggesting the need for feature learning. 

Recent advances in deep learning mm® have attracted great attention in dimension reduction 
[SI EZ! and classification problems nm HUGS]. The advantages of deep learning are that they give 
mappings which can capture meaningful structure information in the code space and introduce 
bias towards configurations of the parameter space that are helpful for unsupervised learning [6]. 
More specifically, it learns the composition of multiple non-linear transformations (such as stacked 
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restricted Boltzmann machines), with the purpose to yield more abstract and ultimately more useful 
representations [3j. In addition, deep learning with gradient descent scales linearly in time and space 
with the number of train cases, which makes it possible to apply to large scale data sets |9j. 

Unfortunately, little work has been done to leverage the advantages of deep learning for unsupervised 
clustering problems. Moreover, unsupervised clustering also presents a challenge in the deep learning 
framework, compared to supervised methods in the final fine-tuning process. Another important 
research topic in clustering analysis is how to adapt model complexity for increasing volumes in the 
era of big data HO II EH- However, most approaches are generative models and have restrictions 
on the prior base measures. 

In this paper, we are interested in clustering problems and propose a deep belief network (DBN) 
with nonparametric clustering. This approach is an unsupervised clustering method, inspired by 
the advances in unsupervised feature learning with DBN, as well as nonparametric Bayesian models 
mm®. On the one hand, clustering performance depends heavily on data representation, which 
implies the need for feature learning in clustering. On the other hand, while the nonparametric 
Bayesian model can perform model selection and data clustering, it is intractable for non-conjugate 
prior; furthermore, it may not perform well on high-dimensional data, especially in terms of space 
and time complexity. Thus, we propose the deep learning with nonparametric maximum margin 
model for clustering analysis. Essentially, we first pre-train DBN for feature learning and dimension 
reduction. Then, we will learn the clustering weights discriminatively with nonparametric maximum 
margin clustering (NMMC), which can be updated online efficiently. Finally, we fine-tune the model 
parameters in the deep belief network. Refer to Fig. 0 for visual understanding to our model. 
Hence, our framework can handle high-dimensional input features with nonlinear mapping, and 
cluster large scale data sets with model selection using the online nonparametric clustering method. 

Our contributions can be mainly summarized as: (1) leveraging unsupervised feature learning with 
DBN for clustering analysis; (2) a discriminative approach for nonparametric clustering under max¬ 
imum margin framework. The experimental results show advantages of our model over competitive 
baselines. 

2 Related work 

Clustering has been an interesting research topic for decades, including a wide range of techniques, 
such as generative/discriminative and parametric/nonparametric approaches. As an discriminative 
method, maximum margin clustering (MMC) treats the label of each instance as a latent variable 
and uses SVM for clustering with large margins. However, they [2 28j either cannot learn parameters 
online efficiently or need to define the number of clusters like other clustering approaches, such 
as k-means, Gaussian mixture model (GMM) and spectral clustering. Considering the weakness 
of parametric models mentioned above, many nonparametric methods H H El 03 have been 
proposed to handle the model complexity problems. One of the widely used nonparametric models 
for clustering is Dirichlet process mixture (DPM) jl} [7J. DPM can learn the number of mixture 
components without specified in advance, which can grow as new data come in. However, the 
behavior of the model is sensitive to the choice of prior base measure Go- In addition, DPM of 
Gaussians need to calculate mean and covariance for each component, and update covariance with 
Cholesky decomposition, which may lead to high space and time complexity in high-dimensional 
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Figure 1: In this DBN, L indicates the total number of hidden layers, is the weight between 
adjacent layers, for i = {1,L} and © is the weight for clustering learned with NMMC. This graph 
demonstrates 3 steps in our model: (1) Feature learning with deep belief network (DBN), with 
weights learned layer by layer as described above; (2) Perform clustering analysis with NMMC, 
which can assign a cluster label for each element in the data; (3) Update the model parameters with 
fine-tuning process (only for Wl and ©). 


data. Unsupervised feature learning with deep structures was first proposed in [9] for dimension 
reduction. Later, this unsupervised approach was developed into semi-supervised embedding m 
and supervised mapping m scenarios. Many other supervised approaches also exploit deep learning 
for feature extraction and then learn a discriminative classifier with objectives, e.g., square loss |9j, 
logistic regression m or support vector machine (SVM) [13, 23j for classification in the code space. 
The success behind deep learning is that it can learn useful information for data visualization and 
classification mm- Thus, it is desirable to leverage deep learning for clustering analysis, because the 
performance for clustering depends heavily on data representation. Unfortunately, little attention 
has been paid to leveraging deep learning for unsupervised clustering problems. 

A recent interesting approach is the implicit mixture of RBMs m Instead of modeling each 
component with Gaussian distribution, it models each component with RBM. It is formulated as a 
third-order Boltzmann machine with cluster label as the hidden variable for each instance. However, 
it also requires the number of clusters specified as input. 

In this paper, we are interested in deep learning for unsupervised clustering problems. In our 
framework, we take advantage of deep learning for representation learning, which is helpful for 
clustering analysis. Moreover, we take an discriminative approach, namely nonparametric maximum 
margin clustering to infer model complexity online, without the prior measure assumption as DPM. 


3 Deep learning with nonparametric maximum margin clustering 


In this section, we will first review RBM and DBN for feature learning. Then, we will introduce 
nonparametric maximum margin clustering (NMMC) method given the feature learned from DBN. 
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Finally, we will fine-tune our model given the clustering labels for the data. 


3.1 Feature learning with deep belief network 

Assume that we have a training set T> — {v^}A 1? where Vj E if An RBM with n hidden units is a 
parametric model of the joint distribution between a layer of hidden variables h = (/ii, h n ) and 
the observations v = (v\, The RBM joint likelihood takes the form: 


p(v,h) oce- £ ( v ’ h ) (1) 

where the energy function is 

E(y, h) = -h T Wiv - b T v - c T h (2) 

And we can compute the following conditional likelihood: 

P(v|h) = tJp(vi|h) (3a) 

2 

p(v% = l|h) = logistic + ^2W 1 {i,j)h j ) (3b) 

3 

p(hi = l|v) = logistic(cj + ^ Wi(j, i)vj ) (3c) 
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where logistic(x) = 1/(1 + e~ x ). To learn RBM parameters, we need to optimize the negative log 
likelihood — logp(v) on training data the parameters updating can be calculated with a efficient 
stochastic descent method, namely contrastive divergence (CD) [ID] , 

A Deep Belief Network (DBN) is composed of stacked RBMs [9] learned layer by layer greedily, 
where the top layer is an RBM and the lower layers can be interpreted as a directed sigmoid belief 
network [3J, shown in Fig. 0. Suppose the DBN used here has L layers, and the weight for each 
layer is indicated as for i = {1,.., L}. Specifically, we think RBM is a 1-layer DBN, with weight 
Wi. Thus, DBN can learn parametric nonlinear mapping from input v to output x, / : v x. 
For example, for 1-layer DBN, we have x = logistic(Wi T v + c). After we learn the representation 
for the data, we use NMCC for clustering analysis to model the data distribution. 


3.2 Nonparametric maximum margin clustering 


Nonparametric maximum margin clustering (NMMC) is a discriminative clustering model for clus¬ 
tering analysis. Given the nonlinear mapping with DBN, we can first map the original training data 
V = {v^}A 1 into codes A = {x^}A 1 in the embedding space. Then, with A = {x^}A 1 and its the 
cluster indicators z = {zi}f =1 , we propose the following conditional probability for nonparametric 
clustering: 


P{^{°k}k=i\X) oc p(z) 


N 

IF(x^) 

2=1 


K 


n P m 

k =1 


( 4 ) 
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where K is the number of clusters, p(xi\0 Zi ) is the likelihood term defined in Sec. and p(0 k ) 

can be thought as the Gaussian prior for k = [1, ...,iT|. Note that the prior p(0 k ) will be used in 

the maximum margin learning in Eq. (12). p{ z) = s y mme t r i c Dirichlet 


prior, where n k is the number of element m the cluster fc, and a is the concentration parameter. 


Recall that Dirichlet process mixture (DPM) [lj |7] is the widely used nonparametric Bayesian 
approach for clustering analysis and model learning, specified with DP prior measure Go and a. As 
a joint likelihood model, it has to model p(X), which is intractable for non-conjugate prior. The 
essential difference between our model and DPM is that we maximize a conditional probability, 
instead of joint probability as in DPM M- Moreover, our approach is a discriminative clustering 
model with component parameters learned under maximum margin framework. 

To maximizing the objective function in Eq. 0. we hope the higher within-cluster correlation and 
lower correlation between different clusters. Given z, we will need to learn {d k }^ =l to keep each 
cluster as compact as possible, which in turn will help infer better K. In other words, to keep the 
objective climbing, we need higher likelihood p(x.i\9 Zi ) with higher correlation within-cluster, which 
can be addressed with discriminative clustering. Given the component parameters, {O k }k=v we 
need to decide the label for each element for better K. For each round (on the instance level), we 
use Gibbs sampling to infer zi for each instance x^, which in turn can be used to estimate 
with online maximum margin learning. For each iteration (on the whole dataset), we also update 
a with adaptive rejection sampling m ■ 


3.2.1 Gibbs sampling 

Given the data points A = {x*}^ and its the cluster indicators z = {zi}fL 1 , the Gibbs sampling 
involves iterations that alternately draw samples from conditional probability while keeping other 
variables fixed. For each indicator variable we can derive its conditional posterior as follows: 


p(zi = fc|z_i,Xi, {G k }% =1 ,a, A) (5) 

= p(zi = fc|xi,z_i, {O k }k=i) (6) 

cx p(zi = k\z-i,{O k }% = 1 )p(-Xi\zi = k,{Ok}k = i) (7) 

= p(zi = k\z_i,a)p(xi\6 k ) ( 8 ) 


where the subscript — i indicates all indices except for z, p(zi — k |z_|,a) is determined by Chinese 
restaurant process, and p(x^|0&) is the likelihood for the current observation x^. For DPM, we need 
to maximize the conditional posterior to compute 6 k , which depends on observations belonging to 
this cluster and prior Go- 

In our conditional likelihood model, we define the following likelihood for instance x^ 

p(xi\e k ) (X exp(xfe k - X\\G k \\ 2 ) (9) 

where A is a regularization constant to control weights between the two terms above. By default, 
the prediction function should be proportional to argmax fc (x^0/ c ), for k G [1, AT]. In other words, 
higher correlation between x^ and indicates higher probability that x^ belongs to cluster fc, which 
further leads to higher objective in Eq. Q. In our likelihood definition, we also subtract A||0^|| 2 
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in Eq. Q, which can keep the maximum margin beneficial properties in the model to separate 
clusters as far away as possible. Another understanding for the above likelihood is that Eq. © 
satisfies the general form of exponential families, which are functions solely of the chosen sufficient 
statistics [22] . Thus, such probability assumption in Eq. © make it general to real applications. 

Plug Eq. © into Eq. ([§]), we get the final Gibbs sampling strategy for our model 

p(zi = k\z_i, Xj, {0 k } k=1 ,a, A) 

ocp(zi = k\z_ h a)exp(xj0 k - A||0 fc || 2 ) (10) 


For the newly created cluster, we assume 6k +i is sampled from multivariate t-distribution. 


We will introduce online maximum margin learning for component parameters {Ok\k=i in Sec 


3.2.2 


3.2.2 Online maximum margin learning 


We follow the passive aggressive algorithm (PA) [5] below in order to learn component parameters 
in our discriminative model with maximum margins m- 

We denote the instance presented to the algorithm on round t by X* G R n , which is associated with a 
unique label zt E [1, K\. Note that the label zt is determined by the above Gibbs sampling algorithm 
in Eq. ( [To] ) . We shall define © = [G 1,..., Ok] a parameter vector by concatenating all the parameters 
{Gk}k Li (that means & Zt is Zt ~th block in ©, or says @ Zt — 0 Zt ), and $(x^,^) is a feature vector 
relating input X* and output z^ which is composed of K blocks, and all blocks but the Zf -th are set 
to be the zero vector while the Zt-th. block is set to be x$. We denote by ©t the weight vector used 
by the algorithm on round £, and refer to the term y(©t; (x£,z*)) = ©t • <&(xt,2t) — ©t • <J>(xt,it) 
as the (signed) margin attained on round t. In this paper, we use the hinge-loss function, which is 
defined by the following, 


£(&■, (x t ,z t )) 


0 if 7(© t ; (x t ,z t )) > 1 

1 - 7(© t ; (x t , z t )) otherwise 


( 11 ) 


Following the passive aggressive (PA) algorithm |5], we optimize the objective function: 

& t+ i = arg mini 
© z 

s.t. £(&; (x t ,zt)) < £ 


= arg mini-11© — © t 11 2 + C £ 


( 12 ) 


where the I 2 norm of © on the right hand size can be thought as Gaussian prior in Eq. @. if 
there’s loss, then the updates of PA-1 has the following closed form 


©t+i — 

©t+l = ©f - T t X t , 


(13) 


where £t is the label prediction for x t , and 7 = min{C, ~ }• Note that the Gibbs sampling 

step can decide the indicator variable z t for X*. Given the cluster label (the ground truth assignment) 
for x^, we update our parameter © using the above Eq. (13). For convergence analysis and time 
complexity, refer to [5] . 
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3.3 Fine-tuning the model 


Having determined the number of clusters and labels for all training data, we can take the fine-tuning 
process to refine the DBN parameters. Note that the objective function in Eq. (12) takes the 
hinge loss as in 


Thus, one possible way is that we can take the sub-gradient and backpropagate 
the error to update DBN parameters. In our approach, we employ another method and only update 
the top layer weights Wl and © in the deep structures. This fine-tuning process is inspired by 
the classification RBM (15] for model refining. Basically, we assume the top DBN layer weight Wl 
and SVM weight © can be combined into a classification RBM as in [l])j by maximizing the joint 
likelihood p(x, z) after we infer the cluster labels for all instances with NMMC. Note that there is 
mapping from SVM’s scores to probabilistic outputs with logistic function H2I, which can maintain 
label consistency between the SVM classifier and the softmax function. Thus, the SVM weight © 
can be used to initialize the weight of the softmax function in the classification RBM. After the 
fine-tuning process, we can max z p( 2 :|v) for z G [1 ,K] to label the unknown data v. For 1-layer 
DBN, we can get the following classification probability: 


p(z |v) = 


e d z W^_ x (l + e c^+©j*+£i 

X 2 * e d ** n?=i (! + e c i +0 i** + £< 


(14) 


where d z for z G [1 ,K] is the bias of clustering labels, and cj for j G [1 , n] are biases of the hidden 
units. Note that © has been reshaped into nx K matrix before updating in the fine-tuning process. 
For the deep neural network with more than one layer, we first project v into the coding space x, 
then use the above equation for classification. 

In our algorithm, we only fine-tune in the top layer because of the following reasons: (1) the objective 
function in Eq. 0 with deep feature learning is non-convex, which can be easily trapped into local 
minimum with L-BFGS |9|; (2) if there was clustering error in the top layer, it could be easily 
propagated in the backpropagation stage; (3) To only update the top layer can effectively handle 
the overfitting problem. 


4 Experimental Results 

In order to analyze our model, we performed clustering analysis on two types of data: images and 
documents, and compared our results to competitive baselines. For all experiments, including pre¬ 
training and fine-tuning, we set the learning rate as 0.1, the maximum epoch to be 100, and used 
CD-I to learn the weights and biases in the deep belief network. We used the adjusted Rand Index 
[TTL 20] to evaluate all the clustering results. 

Clustering on MNIST dataset: The MNIST datasefQ consists of 28 x 28-size images of hand¬ 
writing digits from 0 through 9 with a training set of 60,000 examples and a test set of 10,000 
examples, and has been widely used to test character recognition methods. In the experiment, we 
randomly sample 5000 images from the training sets for parameter learning and 1000 examples from 
the testing sets to test our model. After learning the features with DBN in the pre-training stage, 
we used NMMC for clustering, with setting a = 4, A = 15 and C — 0.001. In the experiment, A 

1 http://yann.lecun.com/exdb/mnist/ 
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Figure 2: The visualization of learned weights in the pre-training and fine-tuning stages respectively 
with 1-layer DBN for n = 100 on the MNIST dataset. 



Figure 3: How the dimensionality and structural depth influence performance on MNIST dataset, 
(a) how the Rand Index changes with the encoded data dimension; (b) how the Rand Index changes 
with the depth of deep structures. It demonstrates the fine-tuning process is helpful to improve 
clustering performance. It also shows that complex deep structures cannot improve clustering 
accuracy. 


plays a vital role on the final number of clusters. Higher A, larger number of clusters generated. To 
make an fair comparison, we basically tuned parameters to keep the number of generated clusters 
close to the groundtruth in the training stage. For example, in the MNIST experiment, we keep it 
around 5 to 20 in the training set for both NMMC and DPM. The results from baselines such as 
k-means and GMM should be conceived as upper bound (specify the number of clusters K — 10). 

The clustering performance of our method (DBN+NMMC) is shown in Table 0 . where “pre¬ 
train” and “fine-tune” indicate how the accuracy changes before and after the fine-tuning process 
for the same parameter setting on the same dataset. The results with 2-layer DBN in Table 0 
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demonstrate that our method significantly outperforms baselines. It also shows that fine-tuning 
process can greatly improve accuracy, especially on the testing data. In Table ([!]), we think the 
largest train/test difference for the least complex model is caused by biases between before and 
after finetuning. In other words, the fine-tuning step can learn better biases via classification RBM 
and improve testing performance. We also visualize how the weights change before and after the 
fine-tuning process in Fig. ©>• 

We also evaluate how the depth and dimensionality of deep structures influence clustering accuracy. 
Fig. |3](a) shows how adjusted Rand Index changes with the number of dimensions for 1-layer DBN 
(or RBM), and it demonstrates that higher dimensionality does not mean higher performance. In 
Fig. [3](a) , we can see fine-tuning severely hurt performance on the training set on higher dimension 
coding space, we guess it is caused by overfitting problem in the complex model. In other words, 
the wrong clustering prediction will deteriorate the clustering performance even further through 
fine-tuning. That makes sense because we treat the wrong labeling as the correct one in the fine- 
tuning stage. It also verifies that it is reasonable by just fine-tuning the model in the top layer, 
instead of the whole network, with the purpose to reduce the overfitting problem. Fig. [3^b) shows 
that given the 100 hidden nodes in the top layer, how the performance changes with the depth of 
DBN structure. It seems that the deeper complex model cannot guarantee better performance. 

To verify whether our NMMC is effective for data clustering and model selection, we also compare 
our NMMC to DPM given the same DBN for feature learning. The results in Fig. ©> demonstrates 
that NMMC outperforms DPM significantly and also shows that our NMMC can always converge 
after 100 iterations. The time complexity comparison between our method and DPM is shown in 
Fig- ! in the DBN projection space. It shows that our method is significantly efficient, compared 
to DPM. To manifest how effective our method is, we also show the upper bound DBN+GMM, 
with 2 layers n — [400,100] in Table ([TJ. It shows that features learned with DBN are helpful for 
clustering, compared to raw data. It also shows that our method yields better clustering results 
than the upper bound. 

Clustering on 20 newsgroup: We also evaluated our model on 20 newsgroup datasets for docu¬ 
ment categorization. This document dataset has 20 categories, which has been widely used in text 
categorization and document classification. In the experiment, we tested our model on the binary 
version of the 20 newsgroup dataset^] We used the training set for training and tested the model 
on the testing dataset. After we learned features in the DBN, we used NMMC for clustering, with 
setting a = 4, A = 30 and C = 0.001. To make an fair comparison, we basically took a similar 
setting as in the MNIST dataset, for both NMMC and DPM in order to generate the number of 
clusters which is comparable for both methods. Baselines such as k-means and GMM should be 
thought of as upper bound because they need to specify the number of clusters K — 20. 

The clustering performance of our method (DBN+NMMC) on 20 newsgroups is shown in Table. 
([2]). It also demonstrates that the fine-tuning process can greatly improve accuracy, especially on the 
testing data. Although our model cannot beat baselines on the training set, our model can achieve 
better evaluation performance on the testing set (better than GMM and k-means on the raw data 
clustering). To verify whether our NMMC is effective for data clustering and model selection, we 
also compare our NMMC to DPM given the same DBN for feature learning. The results in Fig. 

^http://www.cs.toronto.edu/~larocheh/public/datasets/20newsgroups/20newsgroups_{train,valid, 
test}_binary_5000_voc.txt 
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Model 

rand Index 

F-value 

train 

test 

train 

test 

DBN+NMMC (pre-train, n = 100) 

DBN+NMMC (fine-tune, n = 100) 

0.363 ±0.038 

0.371 ± 0.039 

0.181 ±0.07 

0.392 ± 0.043 

0.442 ± 0.032 

0.447 ±0.033 

0.285 ± 0.063 

0.467 ±0.036 

DBN+NMMC (pre-train, n = [400,100]) 
DBN+NMMC (fine-tune, n = [400,100]) 

0.419 ±0.022 

0.428 ±0.021 

0.232 ± 0.09 

0.453 ±0.02 

0.483 ± 0.02 

0.492 ± 0.02 

0.319 ±0.07 

0.513 ±0.016 

DBN+NMMC (pre-train, n = [400,400,100]) 
DBN+NMMC (fine-tune, n = [400,400,100]) 

0.302 ±0.017 

0.309 ±0.015 

0.218 ±0.055 

0.326 ±0.015 

0.394 ±0.014 

0.40 ±0.012 

0.317 ±0.046 
0.415 ±0.02 

DBN+NMMC (pre-train, n = [400,300,200,100]) 
DBN+NMMC (fine-tune, n = [400, 300, 200,100]) 

0.334 ±0.05 
0.34 ±0.051 

0.31 ±0.08 

0.364 ± 0.054 

0.423 ± 0.04 

0.433 ± 0.04 

0.40 ± 0.07 
0.45 ± 0.045 

PCA+NMMC (n = 100) 

IMRBM [17] (n = 100, K = 10) 

0.381 ±0.02 

0.13 ±0.04 

0.251 ±0.022 

0.10 ±0.03 

0.452 ± 0.02 

0.23 ±0.02 

0.353 ±0.02 

0.22 ±0.02 

k-means (K = 10) 

GMM (. K = 10) 

Spectral Clustering (K = 10) 

DBN + kmeans ( K = 10) 

DBN + GMM (K = 10) 

0.356 ± 0.029 

0.356 ±0.029 

0.354 ±0.057 

0.411 ±0.016 

0.411 ±0.016 

0.367 ±0.03 
0.394 ± 0.04 
0.359 ± 0.035 
0.316 ±0.027 

0.406 ± 0.022 

0.446 ± 0.026 

0.446 ± 0.025 

0.423 ± 0.045 

0.473 ± 0.015 
0.473 ±0.015 

0.451 ±0.026 

0.465 ± 0.026 
0.423 ± 0.03 

0.401 ±0.019 

0.467 ±0.024 


Table 1: The experimental comparison on MNIST dataset, where “train” means the training data, 
“test” indicates the testing data, n specifies the number of hidden variables for each layer (for 
example, n = [400,100] indicates DBN has two layers, the first layer has 400 hidden nodes, and the 
second layer has 100 hidden nodes). For PCA+NMMC, we first use PCA project the data into 100 
dimensions, then perform NMMC for clustering. It demonstrates that the fine-tuning process in 
our model can improve clustering performance greatly, and our method (DBN+NMMC) beats the 
baselines remarkably when n = [400,100]. 



the number of iterations the number of iterations 

( a ) ( b ) 


Figure 4: The performance comparison between DPM and NMMC on the MNIST dataset with the 
same DBN structure for feature learning, (a) it is a 1-layer DBN (or RBM) with the number of 
hidden nodes n = 100; (b) it is a 2-layers DBN, with n = [400,100] for each layer. It demonstrates 
that with the same DBN for feature learning, NMMC outperforms DPM remarkably. 

©> demonstrate that NMMC outperforms DPM remarkably. To test how time complexity changes 
with respect to the number of dimensions in the projected space, we tried different coding spaces 
and compared our method with DPM, with results shown in Fig. [5j Again, it demonstrates our 
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Figure 5: The complexity comparison between DPM and NMMC in the data projection space, (a) 
shows how the time complexity varies with the number of training data on the MNIST data set, 
under the 1-layer DBN with 100 hidden nodes; (b) shows how the time complexity changes with 
the number of hidden nodes on the 20 newsgroup dataset, under the 1-layer DBN. It shows that 
our method is more efficient than DPM on the data clustering. 


Model 

rand Index 

F-value 

train 

test 

train 

test 

DBN+NMMC (pre-train, n = 200) 

0.059 + 0.02 

0.034 + 0.016 

0.131 + 0.017 

0.11 + 0.012 

DBN+NMMC (fine-tune, n = 200) 

0.069 + 0.023 

0.065 + 0.025 

0.142 + 0.019 

0.141 + 0.02 

DBN+NMMC (pre-train, n — [1000,200]) 

0.048 + 0.014 

0.028 + 0.007 

0.109 + 0.005 

0.098 ± 0.007 

DBN+NMMC (fine-tune, n = [1000,200]) 

0.047 + 0.015 

0.043 + 0.013 

0.108 + 0.006 

0.104 + 0.004 

PCA+NMMC (n = 200) 

0.036 ± 0.005 

0.016 + 0.012 

0.11 + 0.005 

0.087 + 0.010 

IMRBM E] (n = 200, K = 20) 

0.015 + 0.005 

0.013 + 0.002 

0.096 ± 0.004 

0.093 ± 0.004 

k-means (K = 20) 

0.075 + 0.02 

0.032 ± 0.004 

0.140 + 0.019 

0.109 + 0.016 

GMM (K = 20) 

0.075 + 0.021 

0.051 + 0.006 

0.140 + 0.019 

0.114 + 0.016 

Spectral Clustering (K = 20) 

0.058 + 0.02 

0.061 + 0.017 

0.126 + 0.013 

0.129 + 0.006 

DBN + Kmeans (K = 20) 

0.237 + 0.007 

0.06 ± 0.036 

0.279 ± 0.008 

0.119 + 0.026 

DBN + GMM (K = 20) 

0.239 + 0.009 

0.125 + 0.056 

0.281 + 0.006 

0.185 + 0.045 


Table 2: The experimental comparison on the 20 newsgroup dataset, where “train” means for 
training data, “test” indicates testing data. It demonstrates that the fine-tuning process in our 
model can improve clustering performance. We compare the performances between our method 
and other baselines. It demonstrates that our method (DBN+NMMC) yields clustering accuracy 
comparable to baselines, and performs better on the testing sets with l-layer DBN. 


method is more efficient in practice. 

To sum up, our model can converge well after 100 iterations from the experiments above. Moreover, 
the fine-tuning process in our model can greatly improve the performance on the test sets. Thus, it 
also shows that the parameters learned with NMMC can be embedded well in the deep structures. 
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Figure 6: The performance comparison between DPM and NMMC with the same DBN structure 
for feature learning on 20 newsgroups, (a) it is a 1-layer DBN (or RBM) with the number of hidden 
nodes n — 200; (b) it is a 2-layers DBN, with n = [1000, 200] for each layer. It demonstrates that 
with the same DBN for feature learning, NMMC outperforms DPM remarkably. 

Conclusion 

Clustering is an important problem in machine learning and its performance highly depends on 
data representation. And, how to adapt the model complexity with data also pose a challenge. 
In this paper, we propose a deep belief network with nonparametric maximum margin clustering. 
This approach is inspired by recent advances of deep learning for representation learning. As 
an unsupervised method, our model leverages deep learning for feature learning and dimension 
reduction. Moreover, our approach with nonparametric maximum margin clustering (NMMC) is a 
discriminative clustering method, which can adapt model size automatically when data grows. In 
addition, the fine-tuning process can incorporate NMMC well in the deep structures. Thus, our 
approach can learn features for clustering and infer model complexity in an unified framework. We 
currently use DBN m instead of deep autoencoders [9] for fast feature learning because the latter 
is time-consuming for dimension reduction. In future work, we will explore deep autoencoders to 
learn better feature representation for clustering analysis. Another interesting topic to be explored 
is how to optimize the depth of deep learning structures in order to improve clustering performance. 
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